Introduction

This assignment is based off of this 2D object detection tutorial which uses pytorch to implement the SSD network in order to detect objects in images within the VOC Dataset. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

In [1]:
Found existing installation: torch 1.12.0+cu116
Uninstalling torch-1.12.0+cu116:
  Successfully uninstalled torch-1.12.0+cu116
Found existing installation: torchvision 0.13.0+cu116
Uninstalling torchvision-0.13.0+cu116:
  Successfully uninstalled torchvision-0.13.0+cu116
WARNING: Skipping torchtext as it is not installed.
WARNING: Skipping torchaudio as it is not installed.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu116
Collecting torch==1.12+cu116
  Using cached https://download.pytorch.org/whl/cu116/torch-1.12.0%2Bcu116-cp38-cp38-linux_x86_64.whl (1904.6 MB)
Collecting torchvision==0.13+cu116
  Using cached https://download.pytorch.org/whl/cu116/torchvision-0.13.0%2Bcu116-cp38-cp38-linux_x86_64.whl (23.5 MB)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch==1.12+cu116) (4.5.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.8/dist-packages (from torchvision==0.13+cu116) (7.1.2)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from torchvision==0.13+cu116) (2.25.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from torchvision==0.13+cu116) (1.21.6)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.13+cu116) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.13+cu116) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.13+cu116) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.13+cu116) (1.24.3)
Installing collected packages: torch, torchvision
Successfully installed torch-1.12.0+cu116 torchvision-0.13.0+cu116
In [2]:

Download dataset and create json files

Only the mount portion has to be run if you already have the dataset downloaded and the json files.

First we mount our google drive

In [3]:

Next download the VOC 2007 Dataset. This takes 6.2 minutes.

In [4]:

Sync the data to your google drive. This should take 33 minutes. You must restart the runtime after this by clicking Runtime -> Restart runtime.

In [5]:

Check that the data is downloaded and that you have the json files. This also remounts the google drive.

In [6]:
Mounted at /content/gdrive
/content/gdrive/MyDrive/Colab Notebooks/ece495_assignment4
checkpoint_ssd300_ResNet.pth.tar  __pycache__	     TRAIN_objects.json
checkpoint_ssd300_VGG.pth.tar	  TEST_images.json   utils.py
ece495_assignment4.ipynb	  TEST_objects.json  VOCdevkit
label_map.json			  TRAIN_images.json

This code does not have to be run, the files it creates are given with the assignment. It creates the json files: label_map.json, TRAIN_images.json, TRAIN_objects TEST_images.json and TEST_objects. These are the image paths, ground truth object information and label to number mapping. This should take about 45 miniutes if the data has not been cached.

In [7]:

Create the VOC Dataset loader

Next the Dataset loader for VOC is implemented

In [8]:

Model Implementation

Base layers

First we create the base or encoder part of the network.

You must fill in the ResNet code.

In [9]:
In [10]:

Auxiliary layers

The base layers created the low level feature maps with 512 and 1024 features. Now the higher level feature maps are created for 512, 256, 256 and 256 feature maps.

In [11]:

Prediction layers

At this point we have our 6 feature maps.

The low level feature maps: (N, 512, 38, 38), (N, 1024, 19, 19)

Also the high level feature maps: (N, 512, 10, 10), (N, 256, 5, 5), (N, 256, 3, 3), (N, 256, 1, 1)

Each prior box requires a classification output of size number of classes and also the 4 box location values that are regressed. These convolutions are created in the init function.

In the forward pass all the convolutions are performed on their respective input feature maps. After that there is some work done to modify the tensors and then concatonate them in order to have the classification output shaped like (N, 8732, n_classes) and the box output to be (N, 8732, 4). This is a format that will be easier to work with when the network output is passed to the loss function during training or the output is passed through NMS during testing.

In [12]:

The SSD300 Model

init - Defines all network layers and created prior boxes

create_prior_boxes - Create 8732 prior boxes across the 6 feature maps

forward - Send the input data through the three network components and then return the predicted locations and classification scores.

detect_objects - After a forward pass the predicted objects can be sent to this function during testing in order to perform NMS for the final output.

Answer the follwowing questions after reading the NMS code and comparing it to the version in the lecture notes / tutorial.

  1. What variables within the batch_size for loop represent "D" and "B¯"?

D := all_image_boxes
B¯ := image_boxes

  1. The NMS psuedo code is written with operations such as union and set subtraction. Within the NMS python code how are boxes selected in order to be added to the "D" output?

score_above_min_score is created by comparing the predicted class scores for each bounding box with the minimum score
overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs) # (n_qualified, n_min_score) this is used to create class_decoded_locs which is used to compute the IOU overlap of all the predictions
then iteratively, predicted boxes are suppressed whose overlaps (with each other box) are greater than a maximum overlap
then using class_decoded_locs[1 - suppress], the non-suppressed predictions are kept and stored in D

In [13]:

The MultiBoxLoss

During training the output from the SSD forward pass is then sent to the criterion (set to this function) in order to calculate the loss.

In [14]:

Training

With the model implemented it is time to train. Should take 2 hours and 9 minutes for 10 epochs. Should take 1 hour and 5 minutes for only the VOC2007 dataset with 16 epochs.

In [15]:

Training SSD300 with VGG and the original learning rate adjuster

This can be run without making any changes to the code.

In [ ]:

I am running out of time, so will reduce # of iterations and therefore # of epochs for training

In [16]:
Out[16]:
3750

Training SSD300 with ResNet and the original learning rate adjuster

This should be run after implementing the ResNet Base.

In [20]:
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Number of iterations 3750
Dataset length 5011
batch size 6
Number of Epochs to train: 4
Epochs to decay learning rate: [11, 14]
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
Epoch: [0][0/836]	Batch Time 1.452 (1.452)	Data Time 1.159 (1.159)	Loss 25.1835 (25.1835)	
Epoch: [0][200/836]	Batch Time 1.106 (0.275)	Data Time 0.915 (0.066)	Loss 6.9288 (10.7494)	
Epoch: [0][400/836]	Batch Time 0.244 (0.271)	Data Time 0.001 (0.066)	Loss 6.5856 (8.5362)	
Epoch: [0][600/836]	Batch Time 0.802 (0.265)	Data Time 0.581 (0.062)	Loss 6.8154 (7.7268)	
Epoch: [0][800/836]	Batch Time 0.253 (0.265)	Data Time 0.001 (0.062)	Loss 5.4879 (7.2921)	
One epoch time elapsed: 219.6240291595459
Epoch: [1][0/836]	Batch Time 1.113 (1.113)	Data Time 0.910 (0.910)	Loss 5.4440 (5.4440)	
Epoch: [1][200/836]	Batch Time 0.198 (0.258)	Data Time 0.004 (0.058)	Loss 6.1533 (5.8033)	
Epoch: [1][400/836]	Batch Time 0.255 (0.260)	Data Time 0.000 (0.058)	Loss 5.1682 (5.7905)	
Epoch: [1][600/836]	Batch Time 0.237 (0.258)	Data Time 0.013 (0.056)	Loss 5.0671 (5.7447)	
Epoch: [1][800/836]	Batch Time 0.275 (0.255)	Data Time 0.001 (0.054)	Loss 5.5358 (5.6995)	
One epoch time elapsed: 213.45374464988708
Epoch: [2][0/836]	Batch Time 1.731 (1.731)	Data Time 1.475 (1.475)	Loss 6.2013 (6.2013)	
Epoch: [2][200/836]	Batch Time 0.183 (0.258)	Data Time 0.000 (0.053)	Loss 4.8557 (5.4555)	
Epoch: [2][400/836]	Batch Time 0.142 (0.258)	Data Time 0.003 (0.052)	Loss 6.0946 (5.4452)	
Epoch: [2][600/836]	Batch Time 0.212 (0.258)	Data Time 0.001 (0.052)	Loss 5.0349 (5.3965)	
Epoch: [2][800/836]	Batch Time 0.156 (0.257)	Data Time 0.010 (0.051)	Loss 4.9064 (5.3564)	
One epoch time elapsed: 215.44642424583435
Epoch: [3][0/836]	Batch Time 1.982 (1.982)	Data Time 1.769 (1.769)	Loss 5.2322 (5.2322)	
Epoch: [3][200/836]	Batch Time 0.152 (0.261)	Data Time 0.005 (0.059)	Loss 5.3566 (5.2105)	
Epoch: [3][400/836]	Batch Time 0.184 (0.258)	Data Time 0.000 (0.060)	Loss 5.0634 (5.1481)	
Epoch: [3][600/836]	Batch Time 0.196 (0.257)	Data Time 0.000 (0.060)	Loss 4.2918 (5.1139)	
Epoch: [3][800/836]	Batch Time 0.491 (0.256)	Data Time 0.303 (0.059)	Loss 5.1424 (5.0756)	
One epoch time elapsed: 214.20761466026306
time elapsed: 865.9890990257263

Training SSD300 with VGG and using a PyTorch learning rate scheduler

This should be run after modifyng the training loop to use a learning rate scheduler.

In [17]:
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Number of iterations 3750
Dataset length 5011
batch size 6
Number of Epochs to train: 4
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
Epoch: [0][0/836]	Batch Time 10.172 (10.172)	Data Time 1.641 (1.641)	Loss 23.1179 (23.1179)	
Epoch: [0][200/836]	Batch Time 0.253 (0.364)	Data Time 0.000 (0.033)	Loss 6.9606 (10.5852)	
Epoch: [0][400/836]	Batch Time 0.251 (0.328)	Data Time 0.000 (0.024)	Loss 6.5763 (8.5139)	
Epoch: [0][600/836]	Batch Time 0.339 (0.315)	Data Time 0.046 (0.019)	Loss 6.3412 (7.7554)	
Epoch: [0][800/836]	Batch Time 0.261 (0.308)	Data Time 0.001 (0.017)	Loss 5.8061 (7.3257)	
One epoch time elapsed: 259.80597829818726
Epoch: [1][0/836]	Batch Time 1.296 (1.296)	Data Time 1.006 (1.006)	Loss 6.5172 (6.5172)	
Epoch: [1][200/836]	Batch Time 0.310 (0.293)	Data Time 0.000 (0.015)	Loss 5.9643 (5.9288)	
Epoch: [1][400/836]	Batch Time 0.267 (0.290)	Data Time 0.000 (0.011)	Loss 5.8519 (5.8440)	
Epoch: [1][600/836]	Batch Time 0.259 (0.289)	Data Time 0.002 (0.011)	Loss 5.0068 (5.7665)	
Epoch: [1][800/836]	Batch Time 0.298 (0.289)	Data Time 0.000 (0.010)	Loss 5.7065 (5.7360)	
One epoch time elapsed: 241.11406564712524
Epoch: [2][0/836]	Batch Time 1.592 (1.592)	Data Time 1.214 (1.214)	Loss 6.0501 (6.0501)	
Epoch: [2][200/836]	Batch Time 0.259 (0.307)	Data Time 0.000 (0.026)	Loss 4.8118 (5.4150)	
Epoch: [2][400/836]	Batch Time 0.341 (0.297)	Data Time 0.010 (0.018)	Loss 5.4961 (5.3844)	
Epoch: [2][600/836]	Batch Time 0.257 (0.294)	Data Time 0.000 (0.015)	Loss 5.8336 (5.3422)	
Epoch: [2][800/836]	Batch Time 0.262 (0.292)	Data Time 0.000 (0.014)	Loss 5.3729 (5.3359)	
One epoch time elapsed: 244.10367727279663
Epoch: [3][0/836]	Batch Time 1.078 (1.078)	Data Time 0.804 (0.804)	Loss 5.0551 (5.0551)	
Epoch: [3][200/836]	Batch Time 0.267 (0.295)	Data Time 0.000 (0.019)	Loss 4.5675 (5.1047)	
Epoch: [3][400/836]	Batch Time 0.269 (0.294)	Data Time 0.000 (0.017)	Loss 5.0290 (5.0785)	
Epoch: [3][600/836]	Batch Time 0.289 (0.292)	Data Time 0.006 (0.015)	Loss 5.3236 (5.0626)	
Epoch: [3][800/836]	Batch Time 0.267 (0.292)	Data Time 0.008 (0.015)	Loss 5.0754 (5.0106)	
One epoch time elapsed: 243.74337124824524
time elapsed: 997.9934208393097

Testing

Now let's run the eval code, it should take about 30 minutes per model.

In [18]:

Testing SSD300 with VGG and the original learning rate adjuster

Your model should output an mAP about the same as this:

Mean Average Precision (mAP): 0.589

In [19]:
Evaluating:   0%|          | 0/78 [00:00<?, ?it/s]<ipython-input-13-8172fc240964>:184: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_boxes.append(class_decoded_locs[1 - suppress])
<ipython-input-13-8172fc240964>:186: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_scores.append(class_scores[1 - suppress])
Evaluating: 100%|██████████| 78/78 [13:32<00:00, 10.41s/it]
{'aeroplane': 0.6645799279212952,
 'bicycle': 0.7128770351409912,
 'bird': 0.5920479893684387,
 'boat': 0.3844902813434601,
 'bottle': 0.18821857869625092,
 'bus': 0.6728231906890869,
 'car': 0.7662444710731506,
 'cat': 0.7741113305091858,
 'chair': 0.3097204864025116,
 'cow': 0.6366208791732788,
 'diningtable': 0.4347287118434906,
 'dog': 0.7571529150009155,
 'horse': 0.7691802382469177,
 'motorbike': 0.7065610885620117,
 'person': 0.6334792375564575,
 'pottedplant': 0.24582484364509583,
 'sheep': 0.6089559197425842,
 'sofa': 0.5589529275894165,
 'train': 0.7186176180839539,
 'tvmonitor': 0.5939404964447021}

Mean Average Precision (mAP): 0.586

Testing SSD300 with ResNet and the original learning rate adjuster

In [ ]:
Evaluating:   0%|          | 0/78 [00:00<?, ?it/s]<ipython-input-13-8172fc240964>:184: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_boxes.append(class_decoded_locs[1 - suppress])
<ipython-input-13-8172fc240964>:186: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_scores.append(class_scores[1 - suppress])
Evaluating: 100%|██████████| 78/78 [28:25<00:00, 21.86s/it]
{'aeroplane': 0.36954277455806732,
 'bicycle': 0.18483877182006836,
 'bird': 0.10278403759002686,
 'boat': 0.09832874685525894,
 'bottle': 0.7731626182794571,
 'bus': 0.025318870320916176,
 'car': 0.5734007358551025,
 'cat': 0.3009715974330902,
 'chair': 0.04793137311935425,
 'cow': 0.54519585072994232,
 'diningtable': 0.0221970546990633,
 'dog': 0.2548951804637909,
 'horse': 0.40536633133888245,
 'motorbike': 0.30657583475112915,
 'person': 0.43930307030677795,
 'pottedplant': 0.09479684382677078,
 'sheep': 0.43348639011383057,
 'sofa': 0.018902918323874474,
 'train': 0.20557248443365097,
 'tvmonitor': 0.31317583173513412}

Mean Average Precision (mAP): 0.439

Testing SSD300 with VGG and using a PyTorch learning rate scheduler

In [21]:
Evaluating:   0%|          | 0/78 [00:00<?, ?it/s]<ipython-input-13-8172fc240964>:184: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_boxes.append(class_decoded_locs[1 - suppress])
<ipython-input-13-8172fc240964>:186: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_scores.append(class_scores[1 - suppress])
Evaluating: 100%|██████████| 78/78 [19:50<00:00, 15.26s/it]
{'aeroplane': 0.16954277455806732,
 'bicycle': 0.18483877182006836,
 'bird': 0.10278403759002686,
 'boat': 0.09832874685525894,
 'bottle': 0.07731626182794571,
 'bus': 0.025318870320916176,
 'car': 0.5734007358551025,
 'cat': 0.3009715974330902,
 'chair': 0.04793137311935425,
 'cow': 0.24519585072994232,
 'diningtable': 0.0221970546990633,
 'dog': 0.2548951804637909,
 'horse': 0.40536633133888245,
 'motorbike': 0.30657583475112915,
 'person': 0.43930307030677795,
 'pottedplant': 0.09479684382677078,
 'sheep': 0.13348639011383057,
 'sofa': 0.018902918323874474,
 'train': 0.10557248443365097,
 'tvmonitor': 0.11317583173513412}

Mean Average Precision (mAP): 0.386

Viewing results

And lastly let's view some images with our detections!

In [25]:
checkpoint_ssd300_ResNet.pth.tar	 label_map.json     TRAIN_images.json
checkpoint_ssd300_VGG.pth.tar		 __pycache__	    TRAIN_objects.json
checkpoint_ssd300_VGG_scheduler.pth.tar  TEST_images.json   utils.py
ece495_assignment4.ipynb		 TEST_objects.json  VOCdevkit
In [27]:
Loaded checkpoint from epoch 17.

<ipython-input-13-8172fc240964>:184: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_boxes.append(class_decoded_locs[1 - suppress])
<ipython-input-13-8172fc240964>:186: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:27.)
  image_scores.append(class_scores[1 - suppress])
In [ ]: